The life expectancy dataset is a collection of data from various countries around the globe, which providies information on factors that can influence life expectancy. This dataset consists of 22 variables, some of them being: the country, year, life expectancy, GDP, population, schooling, etc.
The data spans over the years from 2000 to 2015. The souces for this data include: the World Health Organization (WHO), the United Nations (UN), the World Bank (WB). This dataset contains data from 193 countries around the world, making it a comprehensive source of information on global health trends.
The dataset is available in CSV format and accessible for analysis. It is used to study the impact of socioeconomic factors on health outcomes, such as life expectancy. It enabled us to study the relationship between various factors, such as income, education, and health. The dataset also provides insights into the disparities in health outcomes between countries and regions. The dataset has been pre-processed and cleaned to ensure that the data is consistent and accurate.
The world income inequality dataset has multiple tables containing different observations of each countries over the span of around 50 years. dataset "income-share-of-the-top-10-pip" consists of Country, Code, Year, Share of the richest decile in total income or expenditure. dataset "income-shares-by-quintile" consists of Country, Code, Year, Share of each quintile in the total income or expenditure. The dataset has been collected from a source cited in the citations section. The dataset is easily accessible for analysis. It is used for studying the inequalities in countries over the years by comparing their richest and poorest quintiles. The dataset has been already pre-processed and cleaned to ensure that data is consistent and accurate.
The main objective of our project is to understand and visulaize the disparity in life expectancy among countries.
For these we have considered various parameters like development status ( developed or developing), Schooling , GDP etc.**
In this Project we analyze the life expectancy data of different countries and explore various factors that affect the life expectancy of countries. We have utlized various data analysis techniques such as hypothesis testing, time series forecasting, and clustering to gain insights and make informed decisions based on the data.
Through this project we also aim to create interactive visualizations to better communicate the findings and provide a better understanding of the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.offline as py
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('Life Expectancy Data.csv')
df
| Country | Year | Status | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | ... | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | Developing | 65.0 | 263.0 | 62 | 0.01 | 71.279624 | 65.0 | 1154 | ... | 6.0 | 8.16 | 65.0 | 0.1 | 584.259210 | 33736494.0 | 17.2 | 17.3 | 0.479 | 10.1 |
| 1 | Afghanistan | 2014 | Developing | 59.9 | 271.0 | 64 | 0.01 | 73.523582 | 62.0 | 492 | ... | 58.0 | 8.18 | 62.0 | 0.1 | 612.696514 | 327582.0 | 17.5 | 17.5 | 0.476 | 10.0 |
| 2 | Afghanistan | 2013 | Developing | 59.9 | 268.0 | 66 | 0.01 | 73.219243 | 64.0 | 430 | ... | 62.0 | 8.13 | 64.0 | 0.1 | 631.744976 | 31731688.0 | 17.7 | 17.7 | 0.470 | 9.9 |
| 3 | Afghanistan | 2012 | Developing | 59.5 | 272.0 | 69 | 0.01 | 78.184215 | 67.0 | 2787 | ... | 67.0 | 8.52 | 67.0 | 0.1 | 669.959000 | 3696958.0 | 17.9 | 18.0 | 0.463 | 9.8 |
| 4 | Afghanistan | 2011 | Developing | 59.2 | 275.0 | 71 | 0.01 | 7.097109 | 68.0 | 3013 | ... | 68.0 | 7.87 | 68.0 | 0.1 | 63.537231 | 2978599.0 | 18.2 | 18.2 | 0.454 | 9.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2933 | Zimbabwe | 2004 | Developing | 44.3 | 723.0 | 27 | 4.36 | 0.000000 | 68.0 | 31 | ... | 67.0 | 7.13 | 65.0 | 33.6 | 454.366654 | 12777511.0 | 9.4 | 9.4 | 0.407 | 9.2 |
| 2934 | Zimbabwe | 2003 | Developing | 44.5 | 715.0 | 26 | 4.06 | 0.000000 | 7.0 | 998 | ... | 7.0 | 6.52 | 68.0 | 36.7 | 453.351155 | 12633897.0 | 9.8 | 9.9 | 0.418 | 9.5 |
| 2935 | Zimbabwe | 2002 | Developing | 44.8 | 73.0 | 25 | 4.43 | 0.000000 | 73.0 | 304 | ... | 73.0 | 6.53 | 71.0 | 39.8 | 57.348340 | 125525.0 | 1.2 | 1.3 | 0.427 | 10.0 |
| 2936 | Zimbabwe | 2001 | Developing | 45.3 | 686.0 | 25 | 1.72 | 0.000000 | 76.0 | 529 | ... | 76.0 | 6.16 | 75.0 | 42.1 | 548.587312 | 12366165.0 | 1.6 | 1.7 | 0.427 | 9.8 |
| 2937 | Zimbabwe | 2000 | Developing | 46.0 | 665.0 | 24 | 1.68 | 0.000000 | 79.0 | 1483 | ... | 78.0 | 7.10 | 78.0 | 43.5 | 547.358878 | 12222251.0 | 11.0 | 11.2 | 0.434 | 9.8 |
2938 rows × 22 columns
df.head()
df.info()
df.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2938 entries, 0 to 2937 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life expectancy 2928 non-null float64 4 Adult Mortality 2928 non-null float64 5 infant deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 percentage expenditure 2938 non-null float64 8 Hepatitis B 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 under-five deaths 2938 non-null int64 12 Polio 2919 non-null float64 13 Total expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 thinness 1-19 years 2904 non-null float64 19 thinness 5-9 years 2904 non-null float64 20 Income composition of resources 2771 non-null float64 21 Schooling 2775 non-null float64 dtypes: float64(16), int64(4), object(2) memory usage: 505.1+ KB
| Year | Life expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | BMI | under-five deaths | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 1-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2938.000000 | 2928.000000 | 2928.000000 | 2938.000000 | 2744.000000 | 2938.000000 | 2385.000000 | 2938.000000 | 2904.000000 | 2938.000000 | 2919.000000 | 2712.00000 | 2919.000000 | 2938.000000 | 2490.000000 | 2.286000e+03 | 2904.000000 | 2904.000000 | 2771.000000 | 2775.000000 |
| mean | 2007.518720 | 69.224932 | 164.796448 | 30.303948 | 4.602861 | 738.251295 | 80.940461 | 2419.592240 | 38.321247 | 42.035739 | 82.550188 | 5.93819 | 82.324084 | 1.742103 | 7483.158469 | 1.275338e+07 | 4.839704 | 4.870317 | 0.627551 | 11.992793 |
| std | 4.613841 | 9.523867 | 124.292079 | 117.926501 | 4.052413 | 1987.914858 | 25.070016 | 11467.272489 | 20.044034 | 160.445548 | 23.428046 | 2.49832 | 23.716912 | 5.077785 | 14270.169342 | 6.101210e+07 | 4.420195 | 4.508882 | 0.210904 | 3.358920 |
| min | 2000.000000 | 36.300000 | 1.000000 | 0.000000 | 0.010000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 3.000000 | 0.37000 | 2.000000 | 0.100000 | 1.681350 | 3.400000e+01 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 2004.000000 | 63.100000 | 74.000000 | 0.000000 | 0.877500 | 4.685343 | 77.000000 | 0.000000 | 19.300000 | 0.000000 | 78.000000 | 4.26000 | 78.000000 | 0.100000 | 463.935626 | 1.957932e+05 | 1.600000 | 1.500000 | 0.493000 | 10.100000 |
| 50% | 2008.000000 | 72.100000 | 144.000000 | 3.000000 | 3.755000 | 64.912906 | 92.000000 | 17.000000 | 43.500000 | 4.000000 | 93.000000 | 5.75500 | 93.000000 | 0.100000 | 1766.947595 | 1.386542e+06 | 3.300000 | 3.300000 | 0.677000 | 12.300000 |
| 75% | 2012.000000 | 75.700000 | 228.000000 | 22.000000 | 7.702500 | 441.534144 | 97.000000 | 360.250000 | 56.200000 | 28.000000 | 97.000000 | 7.49250 | 97.000000 | 0.800000 | 5910.806335 | 7.420359e+06 | 7.200000 | 7.200000 | 0.779000 | 14.300000 |
| max | 2015.000000 | 89.000000 | 723.000000 | 1800.000000 | 17.870000 | 19479.911610 | 99.000000 | 212183.000000 | 87.300000 | 2500.000000 | 99.000000 | 17.60000 | 99.000000 | 50.600000 | 119172.741800 | 1.293859e+09 | 27.700000 | 28.600000 | 0.948000 | 20.700000 |
df = df.fillna(df.mean())
corr_matrix = df.corr()
sns.heatmap(corr_matrix)
plt.show()
From this Coorelation Matric we are able to interpret that Life expectacy has a high coorelationship between 'Life Expectance and Income composition' and 'Life Expectancy and Schooling' very high.
It has a moderate correlationship with 'Alcoholism', 'Percentage expenditure', 'BMI', 'select diseases', and 'GDP'. Whereas it has a very weak coorelationship with factors like 'thinness', 'infant mortality', etc.
We have obtained a box plot of life expectancy vs development status of a country.
import seaborn as sns
sns.boxplot(x='Status', y='Life expectancy ', data=df)
plt.xlabel('Status')
plt.ylabel('Life expectancy')
plt.title('Distribution of life expectancy by status')
plt.show()
From the above box plot we can conclude that the developed countries have a higher median life expectancy than developing countries, with less variability in the distribution. The box plot also shows that developing countries have a wider range of life expectancies, with some countries having very low life expectancy. This can be due to various levels of development a country is in.
import plotly.graph_objs as go
def get_top6_countries(df, year):
df_year = df[df['Year'] == year]
return df_year.nsmallest(6, 'Life expectancy ')
years = df['Year'].unique().tolist()
fig = go.Figure()
df_initial = get_top6_countries(df, years[0])
fig.add_trace(go.Bar(x=df_initial['Country'], y=df_initial['Life expectancy '], name=str(years[0])))
fig.update_layout(title='Countries with the least Life Expectancy', xaxis_title='Country', yaxis_title='Life Expectancy (years)',
xaxis={'type': 'category'}, yaxis_range=[0, 100], barmode='group')
slider = dict(
active=0,
pad={"t": 50},
steps=[],
)
for year in years:
df_top6 = get_top6_countries(df, year)
countries = df_top6['Country'].tolist()
slider['steps'].append(dict(label=str(year), method="update", args=[{"x": [countries], "y": [df_top6['Life expectancy '].tolist()], "visible": [True]},
{"title": str(year)}]))
fig.update_layout(sliders=[slider])
# Display the figure
fig.show()